100 research outputs found
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
We tackle image question answering (ImageQA) problem by learning a
convolutional neural network (CNN) with a dynamic parameter layer whose weights
are determined adaptively based on questions. For the adaptive parameter
prediction, we employ a separate parameter prediction network, which consists
of gated recurrent unit (GRU) taking a question as its input and a
fully-connected layer generating a set of candidate weights as its output.
However, it is challenging to construct a parameter prediction network for a
large number of parameters in the fully-connected dynamic parameter layer of
the CNN. We reduce the complexity of this problem by incorporating a hashing
technique, where the candidate weights given by the parameter prediction
network are selected using a predefined hash function to determine individual
weights in the dynamic parameter layer. The proposed network---joint network
with the CNN for ImageQA and the parameter prediction network---is trained
end-to-end through back-propagation, where its weights are initialized using a
pre-trained CNN and GRU. The proposed algorithm illustrates the
state-of-the-art performance on all available public ImageQA benchmarks
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on
untrimmed videos using convolutional neural networks. Our algorithm learns from
video-level class labels and predicts temporal intervals of human actions with
no requirement of temporal localization annotations. We design our network to
identify a sparse subset of key segments associated with target actions in a
video using an attention module and fuse the key segments through adaptive
temporal pooling. Our loss function is comprised of two terms that minimize the
video-level action classification error and enforce the sparsity of the segment
selection. At inference time, we extract and score temporal proposals using
temporal class activations and class-agnostic attentions to estimate the time
intervals that correspond to target actions. The proposed algorithm attains
state-of-the-art results on the THUMOS14 dataset and outstanding performance on
ActivityNet1.3 even with its weak supervision.Comment: Accepted to CVPR 201
Context-Aware Zero-Shot Recognition
We present a novel problem setting in zero-shot learning, zero-shot object
recognition and detection in the context. Contrary to the traditional zero-shot
learning methods, which simply infers unseen categories by transferring
knowledge from the objects belonging to semantically similar seen categories,
we aim to understand the identity of the novel objects in an image surrounded
by the known objects using the inter-object relation prior. Specifically, we
leverage the visual context and the geometric relationships between all pairs
of objects in a single image, and capture the information useful to infer
unseen categories. We integrate our context-aware zero-shot learning framework
into the traditional zero-shot learning techniques seamlessly using a
Conditional Random Field (CRF). The proposed algorithm is evaluated on both
zero-shot region classification and zero-shot detection tasks. The results on
Visual Genome (VG) dataset show that our model significantly boosts performance
with the additional visual context compared to traditional methods
- β¦